Refactor crypto functions code#18664
Conversation
| "computes blake3 hash digest of the given input" | ||
| ); | ||
|
|
||
| macro_rules! digest_to_scalar { |
There was a problem hiding this comment.
Moved down, closer to where it's actually used
| /// Digest computes a binary hash of the given data, accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`]. | ||
| /// Second argument is the algorithm to use. | ||
| /// Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512. | ||
| pub fn digest(args: &[ColumnarValue]) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Moved to digest.rs as that is the only place it is used
| } | ||
| s | ||
| } | ||
| pub fn utf8_or_binary_to_binary_type( |
There was a problem hiding this comment.
Refactored away, was only used for return types in sha/md5/digest (which are now simplified)
| /// digest a binary array to their hash values | ||
| pub fn digest_binary_array<T>(self, value: &dyn Array) -> Result<ColumnarValue> | ||
| where | ||
| T: OffsetSizeTrait, |
There was a problem hiding this comment.
Essentially inlined these, as didn't see much benefit and they were adding more indirection
| } | ||
| }) | ||
| fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> { | ||
| Ok(DataType::Utf8View) |
There was a problem hiding this comment.
Why md5 returns Utf8View ?
digest and sha return Binary
There was a problem hiding this comment.
Looks like md5 will naturally hex the output, thus it returns a string whereas the other digest functions return the binary output as is
| }; | ||
| use arrow::array::{AsArray, GenericStringArray, StringViewArray}; | ||
| use arrow::array::{Array, ArrayRef, BinaryArray, BinaryArrayType}; | ||
| use arrow::array::{AsArray, StringViewArray}; |
There was a problem hiding this comment.
nit: The two lines above could be merged.
There was a problem hiding this comment.
Fixed the imports (wish rustfmt could handle this for us 😔 )
|
|
||
| /// Digest computes a binary hash of the given data, accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`]. | ||
| /// Second argument is the algorithm to use. | ||
| /// Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512. |
There was a problem hiding this comment.
| /// Standard algorithms are md5, sha1, sha224, sha256, sha384 and sha512. | |
| /// Standard algorithms are md5, sha224, sha256, sha384 and sha512. |
sha1 is not supported in sha.rs and since it is too weak it should not be advertised
There was a problem hiding this comment.
Fixed the docstring; we didn't even support sha1 for this method anyway so seems it was outdated.
| ScalarValue::Binary($INPUT.as_ref().map(|v| { | ||
| let mut digest = $METHOD::default(); | ||
| digest.update(v); | ||
| #[allow(deprecated)] |
There was a problem hiding this comment.
I tried removing this #[allow(deprecated)] annotation locally and clippy still passed
There was a problem hiding this comment.
| #[allow(deprecated)] |
There was a problem hiding this comment.
Thanks for checking, fixed; I think last I tested one of the sha libs was causing a deprecation warning here, so I guess it's been updated since then 🎉
| +------------------------------------------+ | ||
| | <binary_hash_result> | | ||
| +------------------------------------------+ | ||
| +------------------------------------------------------------------+ |
There was a problem hiding this comment.
I double checked this is indeed the output
DataFusion CLI v51.0.0
> select digest('foo', 'sha256');
+------------------------------------------------------------------+
| digest(Utf8("foo"),Utf8("sha256")) |
+------------------------------------------------------------------+
| 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae |
+------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.030 seconds.| pub mod sha256; | ||
| pub mod sha384; | ||
| pub mod sha512; | ||
| pub mod sha; |
There was a problem hiding this comment.
i think technically this is a breaking API change, but not likely to cause major problems. I tagged this PR as an API change so it shows up in the release notes
Which issue does this PR close?
N/A
Rationale for this change
Deduplicate & simplify code in the crypto functions.
What changes are included in this PR?
Fold Sha224/Sha256/Sha384/Sha512 into a common struct.
Cleanup signature & return types.
Simplify code in
datafusion/functions/src/crypto/basic.rsAre these changes tested?
Existing tests.
Are there any user-facing changes?
Some public methods were removed, though I don't believe they were intended to be used outside of other DataFusion crates.